In this lesson we’ll learn the theory behind using Linear Self-Organizing Maps (SOM) as a clustering and mapping technique. We’ll then use Self-Organizing Maps to cluster the UCI wine dataset in R.
To run the code you may need additional packages.
install.packages("ggplot2");
install.packages("kohonen");
require(ggplot2)
## Loading required package: ggplot2
require(kohonen)
## Loading required package: kohonen
## Loading required package: class
## Loading required package: MASS
We will be using the UCI Machine Learning Repository: Wine Data Set. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
The attributes are:
1) Alcohol
2) Malic acid
3) Ash
4) Alcalinity of ash
5) Magnesium
6) Total phenols
7) Flavanoids
8) Nonflavanoid phenols
9) Proanthocyanins
10) Color intensity
11) Hue
12) OD280/OD315 of diluted wines
13) Proline
Feel free to tweet questions to [@NikBearBrown](https://twitter.com/NikBearBrown)
# Load our data
data(wines)
head(wines)
## alcohol malic acid ash ash alkalinity magnesium tot. phenols
## [1,] 13.20 1.78 2.14 11.2 100 2.65
## [2,] 13.16 2.36 2.67 18.6 101 2.80
## [3,] 14.37 1.95 2.50 16.8 113 3.85
## [4,] 13.24 2.59 2.87 21.0 118 2.80
## [5,] 14.20 1.76 2.45 15.2 112 3.27
## [6,] 14.39 1.87 2.45 14.6 96 2.50
## flavonoids non-flav. phenols proanth col. int. col. hue OD ratio
## [1,] 2.76 0.26 1.28 4.38 1.05 3.40
## [2,] 3.24 0.30 2.81 5.68 1.03 3.17
## [3,] 3.49 0.24 2.18 7.80 0.86 3.45
## [4,] 2.69 0.39 1.82 4.32 1.04 2.93
## [5,] 3.39 0.34 1.97 6.75 1.05 2.85
## [6,] 2.52 0.30 1.98 5.25 1.02 3.58
## proline
## [1,] 1050
## [2,] 1185
## [3,] 1480
## [4,] 735
## [5,] 1450
## [6,] 1290
A self-organizing map (SOM) or self-organizing feature map (SOFM) is an artificial neural network (ANN) that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map. This makes SOMs useful for visualizing low-dimensional views of high-dimensional data. In this sense, SOMs are similar to multidimensional scaling. Self-organizing maps with a small number of nodes behave in a way that is similar to K-means, larger self-organizing maps rearrange data in a way that is fundamentally topological in character.
SOM Algorithm
SOM Training
Square Grid Lattice SOM]
Hexoganal Honeycomb Lattice SOM]
The grids have an \(x\) dimension and \(y\) dimension. The topology of the grid is rectangular or hexagonal.
While the neighborhood function \(\theta(u, v, s)\) usually depends on the lattice distance between the BMU (neuron u) and neuron v. Other functions such as a Gaussian function or a radius distance centered around the neuron are also common choices.
A. Randomize the map’s nodes’ weight vectors. The weights of the neurons are initialized either to small random values or sampled evenly from the subspace spanned by the two largest principal component eigenvectors.
Grab an input vector \(\mathbf{D(t)}\). \(t\) is the index of the target input data vector in the input data set \(\mathbf{D}\) Each data point, \(\mathbf{D(t)}\), needs to be mapped to a “neuron.”
Traverse each node in the map. Use the Euclidean distance or another similarity metric between the input vector \(\mathbf{D(t)}\) and the map’s node’s weight vector. The node that produces the smallest distance (i.e. the best matching unit, BMU) is associated with that neuron.
Update the nodes in the neighborhood of the BMU (including the BMU itself) by pulling them closer to the input vector. This is analogous to updating the new means to be the centroids of the observations in the new clusters the means in k-means clustering. That is,
\[ Wv(s + 1) = Wv(s) + \theta(u, v, s) \alpha(s)(D(t) - Wv(s)) \]
Where
E Increase \(s\) and repeat from step B while \(s < \lambda\)
Note that the neighborhood function \(\Theta (u, v, s)\) depends on the lattice distance between the BMU (neuron u) and neuron v. In the simplest form it is 1 for all neurons close enough to BMU and 0 for others, but a Gaussian function is a common choice, too. Regardless of the functional form, the neighborhood function shrinks with time.
A variant algorithm:
A. Randomize the map’s nodes’ weight vectors
B. Traverse each input vector in the input data set
C. Traverse each node in the map
D. Use the Euclidean distance formula to find the similarity between the input vector and the map’s node’s weight vector E. Track the node that produces the smallest distance (this node is the best matching unit, BMU) Update the nodes in the neighborhood of the BMU (including the BMU itself) by pulling them closer to the input vector \(Wv(s + 1) = Wv(s) + \theta(u, v, s) \alpha(s)(D(t) - Wv(s))\) F. Increase \(s\) and repeat from step B while \(s < \lambda\)
The kind of training is called competitive learning n which nodes compete for the right to respond to a subset of the input data.
Using Self-Organising Maps are typcially done as follows:
Select the size and type of the map. The shape can be hexagonal or square. Note that hexagonal grid has six immediate neighbors whereas a square usually has four. This topology determines the number of “neurons.”
The algorithm then
A. Initializes all node weight vectors.
B. Chooses a random data point from the training data to present to the SOM.
C. Finds the “Best Matching Unit” (BMU) in the map. Determine the nodes within the “neighborhood” of the BMU.
D. The size of the neighborhood decreases with each iteration.
E. Adjust weights of nodes (neurons) in the BMU neighborhood towards a chosen datapoint. - The learning rate decreases with each iteration.
- The magnitude of the adjustment is proportional to the proximity of the node to the BMU.
Repeat Steps B-E for a fixed number of iterations or until convergence.
Pros:
Cons:
The Self-Organizing Maps projects high-dimensional data onto a two-dimensional map. The projection preserves the topology of the data so that similar data items will be mapped to nearby locations on the map. As such one of its great strengths is high-dimensional data visualization, akin to multidimensional scaling (MDS).
SOM Clusters
- from Self-Organising Maps for Customer Segmentation using R via @rbloggers
SOM Clusters with Labels
- from Self-Organising Maps for Customer Segmentation using R via @rbloggers
SOM Clusters on Map
- from Self-Organising Maps for Customer Segmentation using R via @rbloggers ## SOM Heatmaps
A heat map is a graphical representation of data where a gradient of values are represented as colors.
SOM Heatmap
- from Self-Organising Maps for Customer Segmentation using R via @rbloggers
Clustering the data with a 5x5 hexagonal grid.
training <- sample(nrow(wines), 120)
Xtraining <- scale(wines[training, ])
# 5, 5, "hexagonal"
fit.som <- som(Xtraining, somgrid(5, 5, "hexagonal"))
map.som <- map(fit.som,
scale(wines[-training, ],
center=attr(Xtraining, "scaled:center"),
scale=attr(Xtraining, "scaled:scale")))
fit.som
## som map of size 5x5 with a hexagonal topology.
## Training data included.
map.som
## $unit.classif
## [1] 14 20 24 25 14 20 24 10 14 14 10 24 14 14 14 14 15 10 24 25 10 10 14
## [24] 2 1 2 9 17 9 8 8 7 7 3 2 9 2 2 2 9 9 9 16 6 16 17
## [47] 18 16 21 21 23 21 16 22 21 21 22
##
## $distances
## [1] 4.738448 3.322401 4.233237 3.031536 2.498157 5.030092 1.862840
## [8] 2.887436 2.935353 2.650436 2.569206 2.697843 2.796350 1.561687
## [15] 2.371052 5.579350 4.642461 1.227115 3.435748 2.636639 2.167767
## [22] 1.682796 6.206298 5.527769 5.241915 5.354827 5.555374 4.627094
## [29] 6.107042 2.399539 2.732844 2.406429 1.749908 9.345226 4.307163
## [36] 3.358718 3.552189 1.495407 3.541806 4.715874 10.913488 2.962301
## [43] 4.791160 7.679222 5.267244 3.427063 2.647084 3.680592 2.319657
## [50] 1.826881 8.051129 4.540358 4.081171 3.378866 3.998951 2.262989
## [57] 5.454054
##
## $whatmap
## [1] 1
##
## $weights
## [1] 1
##
## $scale.distances
## [1] FALSE
The codebook vectors are visualized in a segments plot, which is the default plotting type. The the code plot shows each cluster and the node weight vectors or “codes” associated with each node. That is, the fan chart in the center of the clusters reveals the characteristics that define how much of each attribute were clustered into each particular cluster. High alcohol levels, for example, are associated with wine samples projected in the bottom right corner of the map, while color intensity is largest in the bottom left corner.
plot(fit.som,"codes")
This shows a hundred iterations. This shows the mean distance to the closest codebook vector during training. Training decreases the mean distance with iterations and typically plateaus at around 40 iterations.
plot(fit.som,"changes")
This tells you how many items are in each of the clusters. The count plot can be used as a quality check. Ideally you want a uniform distribution. That is, you would like all the data not to be bunched up in a few nodes (i.e. neurons)
plot(fit.som,"counts")
This is sometimes called the “U-Matrix.” The U-Matrix value of a particular node is the average distance between the node’s weight vector and that of its closest neighbors. Areas of low neighbor distance indicate groups of nodes that are similar and the further apart nodes indicate natural “borders” in the map. That is, units near a class boundary can be expected to have higher average distances to their neighbors.
plot(fit.som,"dist.neighbours")
This shows the mean distance of objects mapped to a unit to the codebook vector of that unit. The smaller the distances, the better the objects are represented by the codebook vectors.
plot(fit.som,"quality")
Clustering the data with a 3x3 rectangular grid.
fit.som <- som(Xtraining, somgrid(3, 3, "rectangular"))
map.som <- map(fit.som,
scale(wines[-training, ],
center=attr(Xtraining, "scaled:center"),
scale=attr(Xtraining, "scaled:scale")))
fit.som
## som map of size 3x3 with a rectangular topology.
## Training data included.
map.som
## $unit.classif
## [1] 5 1 5 1 1 1 1 1 5 5 5 1 5 5 5 5 5 1 1 1 5 1 8 9 6 8 9 3 8 9 9 9 9 8 8
## [36] 8 8 8 8 8 8 8 2 6 3 3 3 3 3 2 2 3 2 2 2 3 2
##
## $distances
## [1] 4.226180 3.558723 4.483021 2.718985 2.559312 6.327844 3.488351
## [8] 5.627123 2.884924 3.355415 2.494739 4.463967 3.188437 2.025838
## [15] 1.895298 4.657931 5.431654 3.257338 3.089630 3.132281 2.832086
## [22] 2.569086 6.562225 7.284487 4.364035 8.494547 6.745760 4.237486
## [29] 6.219615 2.144251 1.459316 3.866462 6.015745 9.324065 5.481112
## [36] 4.572417 4.335282 2.913032 3.407657 6.775008 14.261581 2.511079
## [43] 7.773148 6.880378 5.292597 2.611077 2.768104 4.259863 2.542468
## [50] 1.560628 18.252448 4.014467 3.193235 2.531149 3.346816 1.881971
## [57] 3.273558
##
## $whatmap
## [1] 1
##
## $weights
## [1] 1
##
## $scale.distances
## [1] FALSE
Plotting the 3x3 rectangular grid fit.
plot(fit.som)
plot(fit.som,"codes")
plot(fit.som,"changes")
plot(fit.som,"counts")
plot(fit.som,"dist.neighbours")
plot(fit.som,"quality")